Skip to content

feat(q4_0): Panama SIMD kernel + reconcile MemSeg to split layout#649

Merged
michalharakal merged 3 commits into
chore/resync-api-dumpsfrom
feature/q4_0-panama
May 30, 2026
Merged

feat(q4_0): Panama SIMD kernel + reconcile MemSeg to split layout#649
michalharakal merged 3 commits into
chore/resync-api-dumpsfrom
feature/q4_0-panama

Conversation

@michalharakal

Copy link
Copy Markdown
Contributor

Phase A, part 2. Stacked on #648.

What

  • PanamaVectorQ4_0MatmulKernel — JDK Vector API kernel (decode scale → unpack split-layout nibbles into scratch → SIMD-FMA). Wired via PanamaVectorKernelProvider.matmulQ4_0() (priority 50).
  • Latent-bug fix: the existing JVM MemSegment Q4_0 path used an interleaved nibble layout that doesn't match real GGUF Q4_0 weights. Reconciled dotQ4_0BlockMemSeg, Q4MemorySegmentTensorData (get/set/copyToFloatArray), and the test encoder to the canonical ggml split layout — so MemSeg now agrees with the heap type, the SPI kernels, and DequantOps.dequantQ4_0FromBytes.

Behavior change

This changes the numerical output of the pre-existing Q4_0 MemSeg matmul path (it was self-consistent but mismatched vs ggml). That path had no callers in this repo and was unverified; the fix makes it correct for real Q4_0 weights.

Tests

  • PanamaVectorQ4_0MatmulKernelParityTest — scalar ≈ panama within FMA tolerance across matvec / attention / FFN shapes.
  • QuantizedMemSegMatmulTest — green under the corrected split layout.
  • apiCheck green (delta: PanamaVectorQ4_0MatmulKernel).

Targeting 0.27.0. Next: PR3 Native FFM.

🤖 Generated with Claude Code

michalharakal and others added 2 commits May 30, 2026 19:47
Adds `PanamaVectorQ4_0MatmulKernel` (JDK Vector API): per block, decode
the FP16 scale, unpack the 16 code bytes into 32 sign-corrected floats
in the canonical ggml split layout, then SIMD-FMA against the input
window. Wired through `PanamaVectorKernelProvider.matmulQ4_0()` (priority
50), so `DefaultCpuOpsJvm`'s `q4_0MatmulKernel` now prefers it over the
scalar floor on JDK 21+.

Also fixes a latent layout bug: the existing JVM MemSegment Q4_0 path
(`JvmQuantizedVectorKernels.dotQ4_0BlockMemSeg` and
`Q4MemorySegmentTensorData` get/set/copyToFloatArray) used an
*interleaved* nibble layout (code[2k]/[2k+1] from byte k), which does
NOT match real GGUF Q4_0 weights (split layout: low nibbles → 0..15,
high → 16..31, per `DequantOps.dequantQ4_0FromBytes`). This mismatch is
the likely reason the Q4_0 MemSeg path was never exercised end-to-end.
All three sites + the test encoder are reconciled to the split layout,
so the MemSeg path now agrees with the heap `Q4_0BlockTensorData`, the
scalar/Panama SPI kernels, and canonical ggml.

Tests: PanamaVectorQ4_0MatmulKernelParityTest (scalar≈panama within FMA
tolerance), QuantizedMemSegMatmulTest still green under split layout.
apiCheck green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Completes the Q4_0 kernel stack with a hand-written C kernel at priority
100. Adds native/src/q4_0_matmul.c (split-layout `(code - 8) * d` decode,
tight auto-vectorizing inner loop mirroring q8_0_matmul.c), declares
skainet_q4_0_matmul in skainet_kernels.h, and adds it to CMakeLists.

Kotlin side: NativeQ4_0MatmulKernel (FFM downcall, mirrors
NativeQ8_0MatmulKernel) wired through NativeKernelProvider.matmulQ4_0().
With the bundled libskainet_kernels loaded, KernelRegistry.bestAvailable()
now prefers native → Panama → scalar for Q4_0, same cascade as Q8_0/Q4_K.

Verified locally (cmake build): NativeQ4_0MatmulKernelParityTest passes —
native output matches PanamaVectorQ4_0MatmulKernel within FMA tolerance
across matvec / attention / FFN shapes. CI without the native lib stays
green via the same availability gate the other native parity tests use.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
feat(q4_0): native FFM kernel (skainet_q4_0_matmul)
Base automatically changed from feature/q4_0-core-format to chore/resync-api-dumps May 30, 2026 17:53
@michalharakal michalharakal merged commit e4b16f8 into chore/resync-api-dumps May 30, 2026
7 checks passed
@michalharakal michalharakal deleted the feature/q4_0-panama branch May 30, 2026 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant